Skip to content

Allow nativeparse to parse source code directly#21260

Merged
ilevkivskyi merged 26 commits into
python:masterfrom
bzoracler:nativeparse-source
May 21, 2026
Merged

Allow nativeparse to parse source code directly#21260
ilevkivskyi merged 26 commits into
python:masterfrom
bzoracler:nativeparse-source

Conversation

@bzoracler
Copy link
Copy Markdown
Contributor

This is the mypy counterpart of mypyc/ast_serialize#54

@bzoracler
Copy link
Copy Markdown
Contributor Author

bzoracler commented Apr 17, 2026

Current CI failure is due to changed typing signature of ast_serialize.parse::source, this has been fixed in the corresponding PR in mypyc/ast_serialize (see changed line).

@github-actions

This comment has been minimized.

@bzoracler bzoracler force-pushed the nativeparse-source branch from c8c10dd to ac275e4 Compare April 28, 2026 02:28
@github-actions

This comment has been minimized.

@bzoracler bzoracler force-pushed the nativeparse-source branch from 444f4e9 to 149e459 Compare April 28, 2026 03:07
@bzoracler bzoracler marked this pull request as draft April 28, 2026 03:10
@github-actions

This comment has been minimized.

@bzoracler
Copy link
Copy Markdown
Contributor Author

bzoracler commented Apr 28, 2026

CI failures:

  • Step Compiled with_mypyc: As before, this is fixed in https://github.com/bzoracler/ast_serialize/blob/566ddc362930a821549ca5fbb0d7d0f3bd88eb6e/ast_serialize.pyi#L26
  • These errors should be fixed using the updated binaries built from the changes in Allow parsing source code directly mypyc/ast_serialize#54:
    • E TypeError: argument 'source': 'bytes' object is not an instance of 'str'
    • E ValueError: Source parsing is not supported yet for test_trivial_binary_data_from_string_source
    • E ValueError: Source parsing is not supported yet for testPackageRootMultipleParallel, testParallelRunWithSyntaxError, testCheckingStubPackagesWorksInParallelMode, and job Parallel tests with .*: I believe like the code path for parallel checking causes both the source code and a file name to be passed to parsing functions? I think the tests passed before because either the source argument was not passed or the file_exists check resulted in False (and we fell back to the old parser when the file didn't exist).

Is it possible for CI to run a non-released version of ast_serialize?

@bzoracler bzoracler marked this pull request as ready for review April 28, 2026 04:48
@github-actions

This comment has been minimized.

ilevkivskyi pushed a commit to mypyc/ast_serialize that referenced this pull request May 17, 2026
ilevkivskyi added a commit that referenced this pull request May 17, 2026
Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one comment for now. Also it looks like parallel type-checking is somehow broken by this.

Comment thread mypy/build.py
@github-actions

This comment has been minimized.

@bzoracler bzoracler marked this pull request as draft May 17, 2026 19:13
@bzoracler
Copy link
Copy Markdown
Contributor Author

bzoracler commented May 17, 2026

@ilevkivskyi I don't quite know what's going on here. I checked out 1100800 (the commit before bumping ast-serialize to 0.5.0 on master), installed ast-serialize==0.4.0, and did this:

if options.native_parser:

-     if options.native_parser:
+     if options.native_parser and source:

Parallel checking on my machine crashes with just this change (so none of the changes in this PR were applied). Tracebacks are the same as those in e.g. https://github.com/python/mypy/actions/runs/25999007558/job/76418562048. Do you have any suggestions?

Oops, "parallel checking" would try to use the default parsernot work at all in that case, my bad. I'll look at this in more depth.

@ilevkivskyi
Copy link
Copy Markdown
Member

Hint: source is a required argument for parse(), which value do you think was (and still is) passed there for native parser, and how your change in ast_serialize will handle that?

@ilevkivskyi
Copy link
Copy Markdown
Member

Btw, I added some logging, and it looks like we sometimes pass a non-empty source to parse(), which means there may be a possibility for performance optimization. Ideally we should not read a file in Python unless absolutely necessary, since it is much faster in Rust.

@ilevkivskyi
Copy link
Copy Markdown
Member

Yeah, we eagerly read the file if there is only one file in the parse batch. Anyway, no need to fix it in this PR since this is a pre-existing problem, you can just fix the crash by passing an actual source (which should be None in most cases) instead of hard-coded "".

@github-actions

This comment has been minimized.

@bzoracler bzoracler marked this pull request as ready for review May 18, 2026 20:21
Comment thread mypy/build.py Outdated
Comment on lines +1290 to +1296
if not os.path.exists(path):
build_error(
"Cannot read file '{}': {}".format(
path.replace(os.getcwd() + os.sep, ""),
os.strerror(2), # `errno.ENOENT`
)
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is temporary, I plan on making ast_serialize surface OSError instead in a follow-up.

Comment thread test-data/unit/cmdline.test Outdated
@bzoracler
Copy link
Copy Markdown
Contributor Author

@ilevkivskyi Appreciate the pointers. Some commentary:

  • There are a few places where source="" is hard-coded in mypy.build, I only limited the change in 8e53191 to make the tests pass.
  • This commit b029c44 was done because I didn't want mypy.parse.parse::source: str | bytes | None, as I assume this parsing function is now used out in the wild.

@bzoracler bzoracler requested a review from ilevkivskyi May 18, 2026 20:35
Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ready.

In general, I feel like ability to parse source should simplify things, not vice versa. A bunch of the logic was added solely for the purpose of diverting to the old parser in cases where there is no file.

Let's give this one more iteration (or I can simply do this myself).

Comment thread mypy/build.py Outdated
Comment thread mypy/build.py Outdated
Comment thread mypy/build.py Outdated
path.replace(os.getcwd() + os.sep, ""),
os.strerror(2), # `errno.ENOENT`
)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good place for this check. This is executed in a thread, instead it should be done before parsing, to match existing logic.

Comment thread mypy/parse.py Outdated
Comment thread test-data/unit/cmdline.test Outdated
@bzoracler bzoracler marked this pull request as draft May 18, 2026 23:57
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment thread mypy/build.py Outdated
if post_parse:
self.post_parse_all(states)
# This duplicates a bit of logic from State.parse_file(). This is done to
# optimize handling of states parsed in parallel.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just copied the previous contents of def parse_parallel straight here, as I don't think State.parse_file() can be refactored very simply so that parallel parsing uses the same logic, even with removing the previous sequential states handling.

Comment thread mypy/build.py Outdated
Comment on lines +1283 to +1289
# Handle fake `__init__.py` files due to `--package-root`
if (
(source is None)
and (os.path.dirname(path) in self.fscache.fake_package_cache)
and (os.path.basename(path) == "__init__.py")
):
source = ""
Copy link
Copy Markdown
Contributor Author

@bzoracler bzoracler May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substitutes previous handling of file_exists = self.fscache.exists(path, real_only=True) in the same method.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should not need this if you follow my suggestion in the first comment above.

@bzoracler bzoracler marked this pull request as ready for review May 19, 2026 05:36
@bzoracler bzoracler requested a review from ilevkivskyi May 19, 2026 05:45
Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is moving in the right direction. I have few more comments.

Comment thread mypy/build.py Outdated
state.xpath.replace(os.getcwd() + os.sep, ""),
os.strerror(2), # `errno.ENOENT`
)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is a bit annoying. I guess it is better to keep real_only parameter, then you will be able to write here:

if not self.fscache.exists(state.xpath, real_only=True):
    state.source = state.get_source()

This way you will not need this, and also will be able to remove the ugly check below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've generally avoided trying fixes that involved mutating state.source = outside of methods of State, but if that's ok, I've applied the suggestion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is OK in this case as it will only affect fake/synthetic files.

Comment thread mypy/build.py
sequential_states, parallel_states
)
for state in parallel_parsed_states:
# New parser only returns serialized ASTs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You modified this comment while copying, why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original reason was because parallelize only those parts of the code that can be parallelized efficiently., to me, reads out of context when parse_parallel no longer handles a variable called sequential_states. But I've restored the original comment as-is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not really about sequential states, it is about the general logic: in parse_file() we do (roughly): pre-parse, parse, post-parse. In parse_all() we do: pre-parse sequentially, parse in parallel, post-parse sequentially. This is done like this to avoid overhead of context switches in code that holds the GIL (pre-parse and post-parse).

Comment thread mypy/build.py
Comment thread mypy/build.py Outdated
Comment on lines +1283 to +1289
# Handle fake `__init__.py` files due to `--package-root`
if (
(source is None)
and (os.path.dirname(path) in self.fscache.fake_package_cache)
and (os.path.basename(path) == "__init__.py")
):
source = ""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should not need this if you follow my suggestion in the first comment above.

@github-actions
Copy link
Copy Markdown
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

@bzoracler bzoracler requested a review from ilevkivskyi May 20, 2026 21:17
Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, thanks!

@ilevkivskyi ilevkivskyi merged commit 6f0e77b into python:master May 21, 2026
25 checks passed
@ilevkivskyi
Copy link
Copy Markdown
Member

@bzoracler If you are interested in working more in this direction, I think #21222 and #21514 are good next issues. (Btw it could make sense to fix them in the same PR, as they are somewhat related).

#21515 is another important thing, but it is waiting on input from @JukkaL

@bzoracler bzoracler deleted the nativeparse-source branch May 21, 2026 03:17
@bzoracler
Copy link
Copy Markdown
Contributor Author

I'll take a look at #21222 and #21514 in the next few days.

#21515 looks quite involved so I'll pass on this for now, but I'm keen on seeing it resolved as it looks quite useful for plugins*, so if the work is green-lighted I may submit a PR in the next few weeks if I don't see any work done on it already.

*Ruff parser allows parsing string forward expressions with leading spaces out-of-the-box, Python's ast.parse() doesn't, I've so far resorted to hacks like surrounding string contents representing expressions with () just to get the node column numbers right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants